Project 1 MTH 5320

David Nieves-Acaron

13 October 2021

Problem description in report (What are you trying to do? What is your data? What shape is it?)

Introduction

This project was in part inspired by a Veritasium video detailing how a video can go viral or in a more general sense, achieve a certain number of views. The problem, in a neural networks point of view, would be giving the network certain inputs, and having it predict the number of views or a function of the number of views. Therefore, in a primtive sense, I am trying to predict some sense of how many views a video will get using a feed forward neural network. The data is a set of features from a set of YouTube videos. The shape of the total data is (Videos x Features) where about 63,938 videos were collected, and 12 features were collected, thus making it of a (63,938 x 12) shape.

However, it must be kept in mind that this is not how the input data is handled when entering it into the neural network. The specific shapes used for inputs will be discussed when showcasing the classification code.

My hypothesis on what makes a video viral is that there are four or so main components to it:

  1. Thumbnail
  2. Title
  3. Channel
  4. YouTube Algorithm Magic
  5. The actual content of the video

Given the limitations imposed by this project, I had to make do with some aspects of 3 and 5. Part of the reason why I chose to focus on data collection for this project is because I would like to build on top of the effort realized in this project to be able to perform inference on other parts of the video, such as the Thumbnail, and the Title, as we learn about other types of neural networks such as CNNs.

Data Preparation

The data consisted of the following features

$$\begin{bmatrix} game \ year\\ game \ name\\ likes\\ views\\ dislikes\\ \frac{likes}{dislikes}\\ \frac{views}{dislikes}\\ \frac{likes}{subscribers}\\ uploadDate\\ channel \end{bmatrix}$$

for a selection of about 10,000 videos (or less if the channel had less than that or if something impeded the data collection) from each channel. The collection process made use of the YouTube Data API as well as of the ELK stack (ElasticSearch, LogStash, and Kibana)in order to be able to effectively search through the process. The ELK stack was hosted in AWS (Amazon Web Services) in order to simplify the setup process.

Data Collection took three parts:

The parts will now be explained in more detail.

First Part

I decided to gather some youtube channels big and small in order to represent the different aspects of the gaming community. I specifically chose to focus more on individuals who run channels and who primarily showcase themselves playing games. This is opposed to something like IGN, where although it features gaming, I feel that views come primarily from people wanting news about new releases rather than anything else.

Among some of the Youtube Gaming Channel "Types" that I tried to emphasize when collecting channel ID's were:

Second Part

Gathering a list of video ID's required the use of the Youtube Data API, which has a certain quota that its developers need to satisfy. Specifically, one is allotted 10,000 units per day, where different operations cost different units. The latter is a graph showcasing the different costs associated with each operation. (Youtube Data API Quota Calculator)

Quota costs
resource method cost
activities list 1
captions list 50
insert 400
update 450
delete 50
channelBanners insert 50
channels list 1
update 50
channelSections list 1
insert 50
update 50
delete 50
comments list 1
insert 50
update 50
markAsSpam 50
setModerationStatus 50
delete 50
commentThreads list 1
insert 50
update 50
guideCategories list 1
i18nLanguages list 1
i18nRegions list 1
members list 1
membershipsLevels list 1
playlistItems list 1
insert 50
update 50
delete 50
playlists list 1
insert 50
update 50
delete 50
search list 100
subscriptions list 1
insert 50
delete 50
thumbnails set 50
videoAbuseReportReasons list 1
videoCategories list 1
videos list 1
insert 1600
update 50
rate 50
getRating 1
reportAbuse 50
delete 50
watermarks set 50
unset 50

One key detail which was painfully found out while trying to collect data was that searching (using the /search?channel domain and then matching the channel ID) is very expensive (100 points out of 10,000 points) and I would run into the problem of running out of units. However, how would one be able to list the videos from a channel if not by searching for them? The answer is through playlists, for which listing would be much cheaper. Additionally, one strange trick I found was that a default playlist of user uploads (obtained by replacing the second character of the user's channel ID with 'U') could be used to cheaply list all the uploads of a channel as opposed to relying on whatever playlists the owner channel has made (usually none or a much more limited scope than the entire set of uploads).

The following contains the code used to collect the video ID's from the selected list of channel ID's. Please note that it is not perfect, as some video ID's failed, but that is why I added a try catch statement. Additionally, there is no real fear of duplicates since later on when processing the data in order to put it into the NN, the video ID's get put into a set.

DataCollectionWorking.py

Third Part

The following shows the code used to iterate through the videoID lists for each channel, upload the data to ElasticSearch.

UploadToES.py

ElasticSearch Setup

The setup of the ELK stack (sometimes simply referred to as ElasticSearch) was done by simply creating an AWS account and following the instructions required. One caveat is that both from the perspective of a college student, and in the interest of fairness, I only used free tier limits on this ELK setup, which meant using two t3.small instances with 20GiB in total of EBS volumes. The reason I chose ElasticSearch was because of the powerful visualizations it provides in the form of Kibana. Here are a few examples:

Benchmarking

In order to speed up the process of training, I downloaded a copy of the data locally by querying ElasticSearch. One might think that this defeats the point of having a database, but the true value of ElasticSearch lies in its searching capabilities, its visualizations, as well as the fact that I have all the data in one easily accessible cloud-hosted location where I only have to collect the data once from YouTube and be done with it.

For inputting one of the most important features, I made a function which generates a one-hot encoding of which game is being played in the video. Note that I only used the top 15 categories, including the "N/A" category, which includes off-topic videos which might talk about video games or something video game related, but might not necessarily have a video game entry in the video's page, as seen below:

As for processing and normalizing the inputs, I decided to go firsthand by taking several ratios, such as

"l/d" $\frac{likes}{dislikes}$,

"v/s" or $\frac{views}{subscribers}$,

as well as a non-ratio such as the raw number of subscribers into a power and then dividing that by the max value in order to normalize. I made sure that my values were sensical by printing out the entry to which the max value belonged to. For example, the max subscribers in this whole list belongs to the user PewDiePie, which makes sense.

Then, I try using a benchmark of a one-layer neural net (essentially multi-class linear regression) to serve as a benchmark of how the more complicated architecture of a neural network along with its associated hyperparameters improves the results.

As can be seen, the benchmark already does decently on the dataset, but we will see that adding more layers (thus turning it from linear regression into the realm of neural networks) improves the results.

For now we will perform some experiments with the learning rate. Perhaps a more aggressive learning rate of $\alpha = 0.1$ will result in more accurate results.

Based on this result as well as previous results from other assignments, I will continue to use 0.005 or even smaller gradients. Let me try 0.001.

This one did about the same as $\alpha = 0.005$ but given its inferior results with respect to the 4 classification, I will proceed with $\alpha = 0.005$. Next I will try varying the network architecture by doubling the size of the first hidden layer. With any luck, this should be able to predict more accurately.

That only did slightly better than the input, 16, 16, output configuration seen above. Perhaps decreasing the batch size can help?

Perhaps a happy medium of batch size = 8 can be found?

Clearly, the batch size of 16 does better. Perhaps increasing the batch size to 32 can yield better results?

This probably has resulted in the gradient exploding, so the learning rate will be reduced.

Using an even lower learning rate to further compensate for the large gradient...

Based on this, and on the dimishing returns, it seems safe to say that the batch size of 16 should be the best for what is necessary here. Next I will try using the sigmoid derivative instead of ReLU to see if better results can be achieved.

This is speculation, but it seems like it could be the case that the reason why it does better with sigmoid is because of the fact that the activation for sigmoid makes use of $e$, and that many of the inputs are logarithmic.

Finally, wrapping things up and using the training set on this last setup, we get:

Thus, we get pretty good results in the end. The pattern provided is probably due to the neural network sort of figuring out that by multiplying the views/subscriber ratio times the subscribers (or rather, by adding since they are logarithms), one can get a sort of "baseline" amount of views, given that most subscribers, especially for big channels, will watch the video. Thus, if a channel has about 1M subscribers, and a view/subscriber ratio of about 1.5, it can likely predict that the video will easily get about 1.5M views as a sort of baseline (maybe more, maybe less). Of course, this is only speculation.

Conclusion

The best results were obtained

Like I mentioned before, the focus of this project was data collection, which is why I would like to build on top o f the effort realized here in order to be able to make more accurate predictions using more important features such as thumbnails, title content, etc... using the more sophisticated neural network types talked about later in the course.

One key note to consider about my approach to views is that it is rather naive in that it does not take into account the role of the YouTube algorithm. What this means is that while generally, videos with interesting or eye catching thumbnails, titles, and channels willl be more likely to get views, there is still an element of spontaneity that is present when taking into account the YouTube algorithm. This is especailly true when one considers the changes that have been made to it over the years in order to satisfy different objectives such as:

Although not an algorithm change, further regulation has been introduced as part of an effort to reduce radicalization and borderline content.

Thank you section

I would like to give credit to Corey Schafer for his helpful guide and code sample on how to use the YouTube Data API in Python (specifically for generating the access code).

Collection process

Picture of the computers used at Evans Hall to facilitate the collection process.